Jess Goddard, Rich Pauloo, Ryan Shepherd, Noor Brody
Last updated 2022-03-27 22:23:14
In this report, we present a nationwide water service spatial
boundary layer for community water systems and explain the tiered
approach taken towards this end. This report builds on a previous
technical memorandum (eda_february.html) that should be
read as prerequisite material to understand the broader context,
background, and exploratory data analysis (EDA) that informs the
approach taken herein.
The resulting national water service boundary layer is the product of a “Tiered Explicit, Match, and Model” (henceforth, TEMM) approach. The TEMM is composed of three hierarchical tiers, arranged by data and model fidelity.
Below is a conceptual diagram of the three tiers in the TEMM approach. Tier 1 explicit boundaries always supersede Tier 2 matched proxy boundaries, which in turn always supersede Tier 3 modeled boundaries. Thus, the resulting the resulting water service boundary layer described in this report is combination of all three tiers, and depends on data availability for the water system, and whether or not it matches to a TIGER place.
In the sections that follow, we summarize our approach for each of the three tiers. As we move from Tier 1 to Tier 3, uncertainty increases, hence, we describe each Tier with increasing detail and provide measures of validation and uncertainty for Tier 2 and 3 estimates.
Finally, we encourage the reader to consider the TEMM water service boundary layer not as a final product, but rather, one that may be improved with the assimilation of additional Tier 1 explicit data, improvements to the Tier 2 matching algorithm, and refinement of the Tier 3 model. Ultimately, this entire workflow may be superseded by nationwide Tier 1 data, which would reduce the problem we address in the scope of work to one of simple Tier 1 data assimilation and cleaning.
library(tidyverse)
library(sf)
library(fs)
library(rcartocolor)
library(geofacet)
library(mapview)
# don't allow fgb streaming: delivers self-contained html
mapviewOptions(fgb = FALSE)
# load environmental variable for staging path
staging_path <- Sys.getenv("WSB_STAGING_PATH")
# read TEMM spatial output for key resul summary stats below
temm <- here("temm_layer/2022-03-27_temm.geojson") %>%
st_read(quiet = TRUE) %>%
# drop this column because it's read as a list and causes mapview trouble
select(-service_area_type_code)
# Tier 1 labeled data
wsb_labeled_clean <- path(staging_path, "wsb_labeled_clean.geojson") %>%
st_read(quiet = TRUE)
# service connection count cutoff for community water systems
n_max <- 15
# read the clean matched output and perform minor transforms for plots.
# this has the same number of rows as `temm`, but contains Tier 1 "radius"
d <- read_csv(path(staging_path, "matched_output_clean.csv")) %>%
mutate(
# transform radius from m to km
radius = radius/1000,
# indicate the tier to use for each pwsid
tier = case_when(
has_labeled_bound == TRUE ~ "Tier 1",
has_labeled_bound == FALSE & !is.na(tiger_match_geoid) ~ "Tier 2",
has_labeled_bound == FALSE & is.na(tiger_match_geoid) ~ "Tier 3"
)
) %>%
# filter to CWS and assume each connection must serve at least 1 person
# this drop 267 rows (0.5% of data)
filter(service_connections_count >= n_max,
population_served_count >= n_max) %>%
# remove 834 rows (1.5% of data) not in contiguous US, mostly Puerto Rico
filter(primacy_agency_code %in% state.abb)
# popultion served by all water systems
pop_total <- sum(d$population_served_count)
# calculate count and proportion of people served by each tier
pop <- d %>%
group_by(tier) %>%
summarize(
count = format_bign(sum(population_served_count)),
prop = round((sum(population_served_count)/pop_total)*100, 2)
)The following outline reflects a summary of key findings, followed by a description of each Tier in the TEMM approach. Notably, the TEMM may be refined and re-run as new data sources are ingested or improvements are made to matching and modeling algorithms.
Key Results
Tier 1: Explicit boundaries
Tier 2: Matched TIGER proxy boundaries
Tier 3: Modeled boundaries
Contributions
Recommendations
The key result of this study is a nationwide water service boundary layer. Here we show the proportion of population served by each TEMM Tier at nationwide and statewide scales.
In total, the TEMM data layer represents tap water delivery to 309.53 million people served by 45,564 water systems2.
About 122 million people are covered by Tier 1 spatial data – impressive given that only 12 states that provided explicit boundary data. However, this relatively high coverage rate is unsurprising because these states (AZ, CA, CT, KS, MO, NC, NJ, NM, OK, PA, TX, WA) include notable centers of high-population like CA, TX, and PA.
Together, around 285 million people (92.19% of the population) are covered by either a Tier 1 or a Tier 2 spatial boundary. The remaining approximately 24 million people (7.81%) are covered by a Tier 3 boundary. These results indicate high confidence in the spatial accuracy of the resulting TEMM water boundary layer for community water systems.
pop %>%
kable(col.names = c("Tier", "Population count",
"Population proportion (%)")) %>%
kableExtra::kable_styling(full_width = FALSE)| Tier | Population count | Population proportion (%) |
|---|---|---|
| Tier 1 | 122,360,150 | 39.53 |
| Tier 2 | 163,009,796 | 52.66 |
| Tier 3 | 24,161,416 | 7.81 |
Next, we show the proportion of population covered per TEMM Tier on a state-by-state basis. Notably:
# dataframe for barplot geofacet: population prop served by each tier
dg <- d %>%
group_by(state_code) %>%
# population per state
mutate(state_pop = sum(population_served_count)) %>%
ungroup() %>%
group_by(state_code, tier) %>%
# calculate count/proportion of population served per state and tier
summarise(
count = sum(population_served_count),
prop = count/state_pop
) %>%
ungroup() %>%
distinct() %>%
# add missing NA values per state code and tier group
complete(state_code, tier) %>%
mutate(tier = factor(tier, levels = paste("Tier", 3:1)))
# sanity check the grouped summary above: this should return all 1
# group_by(dg, state_code) %>%
# summarise(s = sum(prop, na.rm = TRUE)) %>%
# pull(s)
# geofacet of TEMM tier coverage per state in terms of population proportion
dg %>%
ggplot(aes(tier, prop, fill = tier)) +
geom_col() +
coord_flip() +
scale_fill_carto_d(direction = -1) +
scale_y_continuous(breaks = c(0, 1, 0.5),
labels = c(0, 1, 0.5)) +
geofacet::facet_geo(~state_code) +
labs(x = "", y = "Proportion of Population Covered") +
guides(fill = "none") +
theme_minimal(base_size = 6) +
theme(panel.grid.minor.x = element_blank())A data table of the the above plot is provided below.
dg %>%
mutate(
count = ifelse(is.na(count), 0, count),
count = format_bign(count),
prop = ifelse(is.na(prop), 0, prop),
prop = round(prop, 3)
) %>%
dt_make()The national water service boundary layer is too large (i.e., around 400 MB) to plot in this interactive report, thus we show a static map below to illustrate the coverage provided by the Tiers 1-3 in the proportions described above, and direct the reader to the TEMM spatial layer in the project repository.
include_graphics(here("src/analysis/sandbox/model_explore/etc/temm-nation.png"))Here, we zoom into California, Nevada, and Oregon – three states with high proportions of Tier 1, 2, and 3 spatial boundaries respectively. The Tier 3 circular buffers in this interactive map represent median (50th percentile) model estimates. Note that when clicking on polygons in the map below, only Tier 2 and 3 data
# plot CA, NV, OR
temm %>%
filter(primacy_agency_code %in% c("CA", "NV", "OR")) %>%
select(pwsid, service_connections_count, tier) %>%
mapview::mapview(zcol = "tier", burst = TRUE)